This section is to clarify the motivation behind the project ,and what we are going to investigate and how we are going to proceed.
The past couple years of all our lives have not been the best. COVID-19 pandemic spread havoc, increasing human suffering, destabilizing the global economy and upending the lives of billions of people around the globe. The importance of health and well being and its larger impact on the global machinery and day to day life of people became clearer to everyone through this troubling times.
Since as a team we all wanted to make a really impactful project that would provide meaningful insights that could help people make effective changes in their lives, we decided to delve deeper in the subsject of health and well being. After a lot of brainstorming, we decided to further analyze the Motor Vehicle Collisions - Crashes data set in New York.
Road accidents are responsible for 1.3 million deaths and 50 million injuries annually all over the world. Road crashes are also the leading killer of children and young people worldwide, aged five to 29. The Covid 19 pandemic killed about 6 million people in about 2 years. The reason we were able to get some control over the disease was through diligent efforts of scientists and doctors to better understand the disease and through this gained knowledge we were able to defeat the pandemic. The same needs to happen with global road safety. The pain of the families loosing their loved ones is excruciating and we need to put in efforts to better understand global road safety to make sure everyone's loved ones return safely home from their respective journeys.
This was our drive to work on this project as we all feel that this is the most pointless way to dies. Also we would like to mention that this project addresses the Sustainable Development Goal 3, regarding "Good Health and Well-being", more specifically target 3.6: Reduce road injuries and deaths. As things stand, they are set to cause a further estimated 13 million deaths and 500 million injuries during the next decade according to UN. Road accidents are entirely preventable, and our priority is to investigate causes so that preventive measures can be implemented.
There are also other countless disadvantages of car collisions (e.g. disrupt traffic flow, cost resources, wastes fuel ,loss of life). Other than homicides, the fatal incidents with which police have the most contact with the public are fatal traffic collisions. So there is huge potential in building a traffic safety model that can help detect reasons of collisions early and help implement systemic measure to prevent these unfortunate events.
So the goal for the end user's experience in this analysis is if first of all to give an understanding of the data in depth. Most people would have several expectations and assumptions for these patterns, like the most accident prone hours, night-driving, bad weather conditions, driver negligence etc. Therefore we want to provide a detailed overview of how the data is distributed along different time perspectives.
In particular, we want to show how the distributions have changed over the years, especially contrasting it with Covid-19 year(2220) as this may provide meaningful insights as the road activity was significantly decreased in 2020. This could provide insights the accidents from an population density perspective, where many hope to see more accidents per capita in more crowded regions.
Moreover, we also want to show to what extent we can predict the accidents, their number, how many people die, where does the next accident happen and the most relevant causes for the accident. And finally, we want to give the user a understanding of how the weather impacts these patterns, as a interesting perspective on the subject.
The dataset was extracted from the NYC OpenData. The size of the data set is 384 MB \ Number of rows 1.88M \ Number of variables is 29
In Part 2: Basics stats we will learn more about our dataset and get familier with it.
!pip install pandas_profiling==3.2.0
Requirement already satisfied: pandas_profiling==3.2.0 in /root/venv/lib/python3.7/site-packages (3.2.0) Requirement already satisfied: seaborn>=0.10.1 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (0.11.2) Requirement already satisfied: PyYAML>=5.0.0 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (6.0) Requirement already satisfied: numpy>=1.16.0 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (1.21.6) Requirement already satisfied: pydantic>=1.8.1 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (1.8.2) Requirement already satisfied: htmlmin>=0.1.12 in /root/venv/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (0.1.12) Requirement already satisfied: missingno>=0.4.2 in /root/venv/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (0.5.1) Requirement already satisfied: joblib~=1.1.0 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (1.1.0) Requirement already satisfied: jinja2>=2.11.1 in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (2.11.3) Requirement already satisfied: phik>=0.11.1 in /root/venv/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (0.12.2) Requirement already satisfied: requests>=2.24.0 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (2.27.1) Requirement already satisfied: multimethod>=1.4 in /root/venv/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (1.8) Requirement already satisfied: matplotlib>=3.2.0 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (3.5.1) Requirement already satisfied: visions[type_image_path]==0.7.4 in /root/venv/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (0.7.4) Requirement already satisfied: tangled-up-in-unicode==0.2.0 in /root/venv/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (0.2.0) Requirement already satisfied: tqdm>=4.48.2 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (4.64.0) Collecting markupsafe~=2.1.1 Using cached MarkupSafe-2.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB) Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (1.2.5) Requirement already satisfied: scipy>=1.4.1 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas_profiling==3.2.0) (1.7.3) Requirement already satisfied: typing-extensions>=3.7.4.3 in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from pydantic>=1.8.1->pandas_profiling==3.2.0) (4.2.0) Requirement already satisfied: idna<4,>=2.5; python_version >= "3" in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from requests>=2.24.0->pandas_profiling==3.2.0) (3.3) Requirement already satisfied: certifi>=2017.4.17 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from requests>=2.24.0->pandas_profiling==3.2.0) (2021.10.8) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from requests>=2.24.0->pandas_profiling==3.2.0) (1.26.9) Requirement already satisfied: charset-normalizer~=2.0.0; python_version >= "3" in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from requests>=2.24.0->pandas_profiling==3.2.0) (2.0.12) Requirement already satisfied: pillow>=6.2.0 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from matplotlib>=3.2.0->pandas_profiling==3.2.0) (9.1.0) Requirement already satisfied: cycler>=0.10 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from matplotlib>=3.2.0->pandas_profiling==3.2.0) (0.11.0) Requirement already satisfied: python-dateutil>=2.7 in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from matplotlib>=3.2.0->pandas_profiling==3.2.0) (2.8.2) Requirement already satisfied: fonttools>=4.22.0 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from matplotlib>=3.2.0->pandas_profiling==3.2.0) (4.33.3) Requirement already satisfied: kiwisolver>=1.0.1 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from matplotlib>=3.2.0->pandas_profiling==3.2.0) (1.4.2) Requirement already satisfied: packaging>=20.0 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from matplotlib>=3.2.0->pandas_profiling==3.2.0) (21.3) Requirement already satisfied: pyparsing>=2.2.1 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from matplotlib>=3.2.0->pandas_profiling==3.2.0) (3.0.8) Requirement already satisfied: networkx>=2.4 in /root/venv/lib/python3.7/site-packages (from visions[type_image_path]==0.7.4->pandas_profiling==3.2.0) (2.6.3) Requirement already satisfied: attrs>=19.3.0 in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from visions[type_image_path]==0.7.4->pandas_profiling==3.2.0) (21.4.0) Requirement already satisfied: imagehash; extra == "type_image_path" in /root/venv/lib/python3.7/site-packages (from visions[type_image_path]==0.7.4->pandas_profiling==3.2.0) (4.2.1) Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas_profiling==3.2.0) (2022.1) Requirement already satisfied: six>=1.5 in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from python-dateutil>=2.7->matplotlib>=3.2.0->pandas_profiling==3.2.0) (1.16.0) Requirement already satisfied: PyWavelets in /root/venv/lib/python3.7/site-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.7.4->pandas_profiling==3.2.0) (1.3.0) ERROR: spacy 3.2.4 has requirement typing-extensions<4.0.0.0,>=3.7.4; python_version < "3.8", but you'll have typing-extensions 4.2.0 which is incompatible. Installing collected packages: markupsafe Attempting uninstall: markupsafe Found existing installation: MarkupSafe 2.0.1 Not uninstalling markupsafe at /shared-libs/python3.7/py-core/lib/python3.7/site-packages, outside environment /root/venv Can't uninstall 'MarkupSafe'. No files were found to uninstall. Successfully installed markupsafe-2.1.1 WARNING: You are using pip version 20.1.1; however, version 22.0.4 is available. You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import random as ran
import collections
import math
import functools
import operator
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from pprint import pprint
from sklearn.model_selection import RandomizedSearchCV
from sklearn import tree
from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource, FactorRange, Legend, HoverTool, VBar
from bokeh.plotting import figure, output_file, show
from bokeh import palettes
from bokeh.sampledata.periodic_table import elements
from bokeh.transform import dodge, factor_cmap
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
plt.style.use('fivethirtyeight') # For better style
plt.rcParams["font.family"] = "DejaVu Sans"
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, plot_confusion_matrix
from datetime import date
from datetime import datetime
import seaborn as sns
import calendar
import holidays
from xgboost import XGBRegressor
import chart_studio
import chart_studio.tools as tls
import chart_studio.plotly as py
from folium import plugins
from folium.plugins import HeatMap
import folium
from pandas_profiling import ProfileReport
import plotly.graph_objects as go
from plotly.subplots import make_subplots # creating subplots
from sklearn.metrics import r2_score
from dateutil.relativedelta import relativedelta
from datetime import timedelta
from sklearn.linear_model import LinearRegression;
from sklearn.linear_model import RidgeCV,Lasso;
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,BaggingRegressor,GradientBoostingRegressor;
from sklearn.tree import DecisionTreeRegressor;from xgboost import XGBRegressor
def push_viz(fig, name):
username = 'tato' # your username
api_key = 'g6OEy1bn0ffMLRNu0Wvq' # your api key - go to profile > settings > regenerate key
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
#Push your visualiztion to your account using the following lines of code:
py.plot(fig, filename = name, auto_open=True)
tls.get_embed('https://plotly.com/~tato/1/')
def push_viz2(fig, name):
username = 'talaaz' # your username
api_key = 'djesapBwV1FfBzb35N6l' # your api key - go to profile > settings > regenerate key
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
#Push your visualiztion to your account using the following lines of code:
py.plot(fig, filename = name, auto_open=True)
tls.get_embed('https://plotly.com/~talaaz/1/')
df = pd.read_csv("/work/Motor_Vehicle_Collisions_-_Crashes.csv")
The Motor Vehicle Collisions crash dataset contains details on the crash event. Each row represents a crash event. The Motor Vehicle Collisions data contain information from all police reported motor vehicle collisions in NYC. The police report (MV104-AN) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage.
The dataset has 29 columns for each data point mainly- CRASH DATE ,CRASH TIME , BOROUGH, ZIP CODE,LATITUDE, LONGITUDE, LOCATION, ON STREET NAME, CROSS STREET NAME, OFF STREET NAME, NUMBER OF PERSONS INJURED, NUMBER OF PERSONS KILLED, NUMBER OF PEDESTRIANS INJURED, NUMBER OF PEDESTRIANS KILLED, NUMBER OF CYCLIST INJURED, NUMBER OF CYCLIST KILLED, NUMBER OF MOTORIST INJURED, NUMBER OF MOTORIST KILLED, CONTRIBUTING FACTOR VEHICLE 1-5 (5 columnsof reasons for accident)2, COLLISION_ID(Unique record id), VEHICLE TYPE CODE 1-5(vehicle types involved in the crash- like bicycle,car/SUV).
Lets look at some samples of the dataset to get a better undersatnding of the data.
df.head()
| CRASH DATE | CRASH TIME | BOROUGH | ZIP CODE | LATITUDE | LONGITUDE | LOCATION | ON STREET NAME | CROSS STREET NAME | OFF STREET NAME | ... | CONTRIBUTING FACTOR VEHICLE 2 | CONTRIBUTING FACTOR VEHICLE 3 | CONTRIBUTING FACTOR VEHICLE 4 | CONTRIBUTING FACTOR VEHICLE 5 | COLLISION_ID | VEHICLE TYPE CODE 1 | VEHICLE TYPE CODE 2 | VEHICLE TYPE CODE 3 | VEHICLE TYPE CODE 4 | VEHICLE TYPE CODE 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 04/14/2021 | 5:32 | NaN | NaN | NaN | NaN | NaN | BRONX WHITESTONE BRIDGE | NaN | NaN | ... | Unspecified | NaN | NaN | NaN | 4407480 | Sedan | Sedan | NaN | NaN | NaN |
| 1 | 04/13/2021 | 21:35 | BROOKLYN | 11217.0 | 40.68358 | -73.97617 | (40.68358, -73.97617) | NaN | NaN | 620 ATLANTIC AVENUE | ... | NaN | NaN | NaN | NaN | 4407147 | Sedan | NaN | NaN | NaN | NaN |
| 2 | 04/15/2021 | 16:15 | NaN | NaN | NaN | NaN | NaN | HUTCHINSON RIVER PARKWAY | NaN | NaN | ... | NaN | NaN | NaN | NaN | 4407665 | Station Wagon/Sport Utility Vehicle | NaN | NaN | NaN | NaN |
| 3 | 04/13/2021 | 16:00 | BROOKLYN | 11222.0 | NaN | NaN | NaN | VANDERVORT AVENUE | ANTHONY STREET | NaN | ... | Unspecified | NaN | NaN | NaN | 4407811 | Sedan | NaN | NaN | NaN | NaN |
| 4 | 04/12/2021 | 8:25 | NaN | NaN | 0.00000 | 0.00000 | (0.0, 0.0) | EDSON AVENUE | NaN | NaN | ... | Unspecified | NaN | NaN | NaN | 4406885 | Station Wagon/Sport Utility Vehicle | Sedan | NaN | NaN | NaN |
5 rows × 29 columns
As you can see that there are a lot of NAN values is colums like Cross Street , Off street etc but since we dont use these columns in our analysis and just use latitude and longitude and borough(district) for geographical analysis of the dataset so we dont need to delete these NAN rows as these columns are removed in our analysis anyway. The columns that are being dropped are-'ZIP CODE','ON STREET NAME','CROSS STREET NAME', and 'OFF STREET NAME'. But since we are keeping latitude and longitude and borough(district), rows with NAN value in these columns are removed.
The dataset has data from 2010 to May 2022. In order to reduce the size of data, use the most relevant recent information and contrast road data of Non-Covid years with Covid years, we are filtering for years from 2018 to 2021.We leave out 2022 too as its an incomplete data year. We have also changed the CRASH DATA to datetime datatype to make the processing easier.
#Change the formate of date
df["CRASH DATE"]= pd.to_datetime(df["CRASH DATE"])
df["CRASH DATE"] = df['CRASH DATE'].dt.date
#Preprocess the data based on year 2018 to 2021
mask = (df['CRASH DATE'] >= pd.to_datetime("2018-01-01")) & (df['CRASH DATE'] < pd.to_datetime("2021-01-01"))
df = df.loc[mask]
#Drop nan values of location
df= df[df['LATITUDE']. notna()]
df= df[df['LONGITUDE']. notna()]
df['BOROUGH']= df['BOROUGH'].fillna('')
#drop columns
df =df.drop(columns=['ZIP CODE','ON STREET NAME','CROSS STREET NAME','OFF STREET NAME'])
As you can see there are 5 columns with Contributing Factor and Vehicle, each corresponding to the 5 Vehicle Type for each vehicle involved in the accident. Lets look at the number of missing values in these columns to better gauge the number of vehicles usually involved in an accident.
df_cf_vtc= df[["CONTRIBUTING FACTOR VEHICLE 1","CONTRIBUTING FACTOR VEHICLE 2","CONTRIBUTING FACTOR VEHICLE 3", "CONTRIBUTING FACTOR VEHICLE 4", "CONTRIBUTING FACTOR VEHICLE 5", "VEHICLE TYPE CODE 1","VEHICLE TYPE CODE 2","VEHICLE TYPE CODE 3", "VEHICLE TYPE CODE 4", "VEHICLE TYPE CODE 5" ]]
print("\n\n Contributing Factor Vehicle Columns Stats \n\n")
print("Total Number of rows with CONTRIBUTING FACTOR VEHICLE 1=nan - ",df["CONTRIBUTING FACTOR VEHICLE 1"].isna().sum() )
print("Total Number of rows with CONTRIBUTING FACTOR VEHICLE 2=nan - ",df["CONTRIBUTING FACTOR VEHICLE 2"].isna().sum() )
print("Total Number of rows with CONTRIBUTING FACTOR VEHICLE 3=nan -",df["CONTRIBUTING FACTOR VEHICLE 3"].isna().sum() )
print("Total Number of rows with CONTRIBUTING FACTOR VEHICLE 4=nan -",df["CONTRIBUTING FACTOR VEHICLE 4"].isna().sum() )
print("Total Number of rows with CONTRIBUTING FACTOR VEHICLE 5=nan -",df["CONTRIBUTING FACTOR VEHICLE 5"].isna().sum() )
print("\n\n Vehicle Type Code Columns Stats \n\n")
print("Total Number of rows with VEHICLE TYPE CODE 1=nan - ",df["VEHICLE TYPE CODE 1"].isna().sum() )
print("Total Number of rows with VEHICLE TYPE CODE 2=nan -",df["VEHICLE TYPE CODE 2"].isna().sum() )
print("Total Number of rows with VEHICLE TYPE CODE 3=nan -",df["VEHICLE TYPE CODE 3"].isna().sum() )
print("Total Number of rows with VEHICLE TYPE CODE 4=nan -",df["VEHICLE TYPE CODE 4"].isna().sum() )
print("Total Number of rows with VEHICLE TYPE CODE 5=nan -",df["VEHICLE TYPE CODE 5"].isna().sum() )
print("\n\n")
print("Total Number of rows with one nan(Either CF or VTC) in column -", df_cf_vtc.isna().any(axis=1).sum())
print("Total Number of rows with atleast one nan in any column -", df.isna().any(axis=1).sum())
print("Total Number of rows in the dataframe -",df.shape[0])
Contributing Factor Vehicle Columns Stats Total Number of rows with CONTRIBUTING FACTOR VEHICLE 1=nan - 1862 Total Number of rows with CONTRIBUTING FACTOR VEHICLE 2=nan - 86868 Total Number of rows with CONTRIBUTING FACTOR VEHICLE 3=nan - 478342 Total Number of rows with CONTRIBUTING FACTOR VEHICLE 4=nan - 506281 Total Number of rows with CONTRIBUTING FACTOR VEHICLE 5=nan - 512243 Vehicle Type Code Columns Stats Total Number of rows with VEHICLE TYPE CODE 1=nan - 3714 Total Number of rows with VEHICLE TYPE CODE 2=nan - 118270 Total Number of rows with VEHICLE TYPE CODE 3=nan - 480131 Total Number of rows with VEHICLE TYPE CODE 4=nan - 506619 Total Number of rows with VEHICLE TYPE CODE 5=nan - 512317 Total Number of rows with one nan(Either CF or VTC) in column - 512347 Total Number of rows with atleast one nan in any column - 512347 Total Number of rows in the dataframe - 514542
As it can be seen above that the number of rows with atleast one nan in Contributing factor columns and number of rows when CONTRIBUTING FACTOR VEHICLE 5=nan is the same(553736). Also the number of nan's in the dataframe decrease as we go from CONTRIBUTING FACTOR VEHICLE 5 to CONTRIBUTING FACTOR VEHICLE 1. This means that the police officers fill in these columns from 1-5 depending on the number of vehicles involved in the accident. Same applies to Vehicle type code as well.
From this insight, we can create two new columns, one with the number of vehicles in the accident and two with the combined contributing factors.
Lets first clean the contributing factors columns and map them on a dictionary for easy combing and processing of this column
df["CONTRIBUTING FACTOR VEHICLE 2"]= df["CONTRIBUTING FACTOR VEHICLE 2"].replace("Reaction to Other Uninvolved Vehicle","Reaction to Uninvolved Vehicle")
df["CONTRIBUTING FACTOR VEHICLE 2"]= df["CONTRIBUTING FACTOR VEHICLE 3"].replace("Reaction to Other Uninvolved Vehicle","Reaction to Uninvolved Vehicle")
df["CONTRIBUTING FACTOR VEHICLE 2"]= df["CONTRIBUTING FACTOR VEHICLE 4"].replace("Reaction to Other Uninvolved Vehicle","Reaction to Uninvolved Vehicle")
df["CONTRIBUTING FACTOR VEHICLE 2"]= df["CONTRIBUTING FACTOR VEHICLE 5"].replace("Reaction to Other Uninvolved Vehicle","Reaction to Uninvolved Vehicle")
df_cf= df[["CONTRIBUTING FACTOR VEHICLE 1","CONTRIBUTING FACTOR VEHICLE 2","CONTRIBUTING FACTOR VEHICLE 3", "CONTRIBUTING FACTOR VEHICLE 4", "CONTRIBUTING FACTOR VEHICLE 5" ]]
cf_dict= dict(enumerate(df["CONTRIBUTING FACTOR VEHICLE 1"].unique()))
inv_map = {v: k for k, v in cf_dict.items()}
df_cf=df_cf.replace({"CONTRIBUTING FACTOR VEHICLE 1": inv_map})
df_cf=df_cf.replace({"CONTRIBUTING FACTOR VEHICLE 2": inv_map})
df_cf=df_cf.replace({"CONTRIBUTING FACTOR VEHICLE 3": inv_map})
df_cf=df_cf.replace({"CONTRIBUTING FACTOR VEHICLE 4": inv_map})
df_cf=df_cf.replace({"CONTRIBUTING FACTOR VEHICLE 5": inv_map})
Now we will aggregate the information in Vehic and CONTRIBUTING FACTOR VEHICLE columns into two columns- one with vehicle count and the other being contributing factors list
Vehicle_count=[]
contri_fac=[]
Car_killed=[]
Car_injured=[]
df_check=df.isnull()
for index, row in df_check.iterrows():
c=0
ck= df.loc[index, 'NUMBER OF PERSONS KILLED']-df.loc[index, 'NUMBER OF PEDESTRIANS KILLED']-df.loc[index, 'NUMBER OF CYCLIST KILLED']-df.loc[index, 'NUMBER OF MOTORIST KILLED']
ci= df.loc[index, 'NUMBER OF PERSONS INJURED']-df.loc[index, 'NUMBER OF PEDESTRIANS INJURED']-df.loc[index, 'NUMBER OF CYCLIST INJURED']-df.loc[index, 'NUMBER OF MOTORIST INJURED']
Car_killed.append(ck)
Car_injured.append(ci)
if(row['VEHICLE TYPE CODE 1']==False):
c=c+1
if(row['VEHICLE TYPE CODE 2']==False):
c=c+1
if(row['VEHICLE TYPE CODE 3']==False):
c=c+1
if(row['VEHICLE TYPE CODE 4']==False):
c=c+1
if(row['VEHICLE TYPE CODE 5']==False):
c=c+1
Vehicle_count.append(c)
cf=[]
if(row['CONTRIBUTING FACTOR VEHICLE 1']==False):
cf.append(int(df_cf.loc[index, 'CONTRIBUTING FACTOR VEHICLE 1']))
if(row['CONTRIBUTING FACTOR VEHICLE 2']==False):
cf.append(int(df_cf.loc[index, 'CONTRIBUTING FACTOR VEHICLE 2']))
if(row['CONTRIBUTING FACTOR VEHICLE 3']==False):
cf.append(int(df_cf.loc[index, 'CONTRIBUTING FACTOR VEHICLE 3']))
if(row['CONTRIBUTING FACTOR VEHICLE 4']==False):
cf.append(int(df_cf.loc[index, 'CONTRIBUTING FACTOR VEHICLE 4']))
if(row['CONTRIBUTING FACTOR VEHICLE 5']==False):
cf.append(int(df_cf.loc[index, 'CONTRIBUTING FACTOR VEHICLE 5']))
contri_fac.append(sorted(list(dict.fromkeys(cf))))
df.insert(1, "NUMBER OF PERSONS IN CARS KILLED", Car_killed)
df.insert(1, "NUMBER OF PERSONS IN CARS INJURED", Car_injured)
df.insert(1, "Vehicle Count", Vehicle_count)
vtc= [ "VEHICLE TYPE CODE 1","VEHICLE TYPE CODE 2","VEHICLE TYPE CODE 3", "VEHICLE TYPE CODE 4", "VEHICLE TYPE CODE 5" ]
df =df.drop(columns=vtc)
df.insert(1, "Combined Contributing Factors", contri_fac)
cf= ["CONTRIBUTING FACTOR VEHICLE 2","CONTRIBUTING FACTOR VEHICLE 3", "CONTRIBUTING FACTOR VEHICLE 4", "CONTRIBUTING FACTOR VEHICLE 5" ]
df =df.drop(columns=cf)
# ki=['NUMBER OF PEDESTRIANS INJURED','NUMBER OF PEDESTRIANS KILLED','NUMBER OF CYCLIST INJURED','NUMBER OF CYCLIST KILLED','NUMBER OF MOTORIST INJURED','NUMBER OF MOTORIST KILLED']
# df =df.drop(columns=ki)
del(df_cf)
del(df_cf_vtc)
Now we have the two new columns with vehicle count and combined contributing factors. By doing the above aggregation we were able to eliminate a lot of missing values and managed to condense the columns into useful information.
Further to increase ease of use, we are pre-processing the data to create columns like Year, Month Time, Day Of Week Number, Day Of Week, Hour of the day, Time of day in percent of the day passed, and Hour of the week in a seperate dataframe. This was done to make the plotting and processing of time series for different time periods easier in the analysis and we used a new dataframe to keep the original one in mint condition.
crash_data = df
## Create column of datetime object
crash_data['DateTime'] = pd.to_datetime(crash_data['CRASH DATE'])
## Create column of datetime year
crash_data['Year'] = crash_data['DateTime'].apply(lambda x: x.year)
crash_data['Time'] = pd.to_datetime(crash_data['CRASH TIME'], format='%H:%M')
crash_data['DayOfWeekNumber']= pd.DatetimeIndex(crash_data['DateTime']).weekday
crash_data['DayOfWeek']= crash_data['DateTime'].dt.day_name()
crash_data['hourofday']= crash_data['Time'].dt.hour
crash_data['minute']= crash_data['Time'].dt.minute
crash_data['timeofdaypercent'] = crash_data['hourofday'] +(crash_data['minute']/60)
crash_data['month']= pd.DatetimeIndex(crash_data['DateTime']).month_name()
crash_data['hourofweek']= crash_data['hourofday'] + (crash_data['DayOfWeekNumber']*24)
crash_data['weekofyear']= df['DateTime'].dt.week
## Given order=
weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
df_weekday= crash_data.groupby('DayOfWeek').size().reset_index(name="Count")
df_weekday['DayOfWeek'] = pd.Categorical(df_weekday['DayOfWeek'], categories=weekday, ordered=True)
df_weekday = df_weekday.sort_values('DayOfWeek')
df_hourofday= crash_data.groupby('hourofday').size().reset_index(name="Count")
df_hour_of_week= crash_data.groupby('hourofweek').size().reset_index(name="Count")
df_weekofyear= crash_data.groupby('weekofyear').size().reset_index(name="Count")
df_year= crash_data.groupby('Year').size().reset_index(name="Count")
df_week_year= crash_data.groupby(['Year','weekofyear']).size().reset_index(name="Count")
df_month_year= crash_data.groupby(['Year','month']).size().reset_index(name="Count")
df_month_year['month'] = pd.Categorical(df_month_year['month'], categories=months, ordered=True)
df_month_year = df_month_year.sort_values('month')
df_month= crash_data.groupby('month').size().reset_index(name="Count")
df_month['month'] = pd.Categorical(df_month['month'], categories=months, ordered=True)
df_month = df_month.sort_values('month')
The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions in NYC. The police report (MV104-AN) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage (https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/documents/ny_overlay_mv-104an_rev05_2004.pdf).
# profile = ProfileReport(df, title="Pandas Profiling Report")
# profile
# profile.to_file("basic_stats.html")
In this section, we will analyse the data in more detail. We will start by investigating the temporal patterns of the data - exploring the number of accidents by day of week, hour of the day, month, and hour of the week and year. Then we will look at a distribution of the accidents across different regions in the city on a heatmap. After that we will look at the accidents across districts in terrms of how many people died, people injured, motorist died, motorist injured, pedestrian died, pedenstrian injured, cyclist died and cyclist injured. We further provide a interactable visualization to explore the district on the death and injury tolls.
In this first part of the data analysis, we'll look into the temporal patterns of the data to investigate questions like:
For the purpose of answering these questions, we use visualizations of aggregate accidents counts over time to make patterns more noticable. We found similar figures on plotting deaths/injured people count as well reaffirming common logic that more accidents lead to more people dying and getting injured. We show one plot for comparison for the viewer in the accidents by day of the week section. We believe that the visualizations would make it easier to gauge patterns over time, rather than trying to analyze a bunch of numbers.
So the first pattern we want to investigate is the hourly patterns during a day. We want to examine how many accidents happen over hour of day. We wanted to see if patterns appear from looking at the total number of accidents happening over hour of day just over the entire time frame of 3 years.
fig = px.bar(df_hourofday, x='hourofday', y="Count", title='Number of accidents by hour of day', color="hourofday", text='Count')
fig.update_layout(showlegend=False)
fig.update_xaxes(title='hours of day')
fig.update_yaxes(title='Number of accidents')
fig.update_traces(hovertemplate='Hour: %{x} <br>Number of accidents: %{y}')
fig.show()
push_viz2(fig,'accidentsbyHourofDay') #pushing the plot to plotly servers
The visualization above reveals some interesting results and reaffirms a lot of what we were expecting. The accidents number is low in the night hours except a minor peak at midnight. This is probably because of people sleeping and less people being on roads. Also it can be seen that the hours of 16-17 have the global peak, where most accidents happen in the day. This could be due to several reasons- peak traffic times , tiredness after a long day, more rushing to home mentality etc. We also found that contrary to popular beliefs, night driving is generally much safer than day driving as less accidents happen in the night hours, probably due to less people on the roads.
In this section we explore the accident count based on a weekly time scale. First we look at plainly the number of accidents based on day of week aggregated over all years. Then we will look at accident count over the hour of week it happened in.
First, lets inspect the accident count based on day of week:
fig = px.bar(df_weekday, x='DayOfWeek', y="Count", title='Number of accidents by Weekday', color='DayOfWeek',text='Count')
fig.update_layout(showlegend=False)
fig.update_xaxes(title='Weekdays')
fig.update_yaxes(title='Number of accidents')
fig.update_traces(hovertemplate='Day: %{x} <br>Number of accidents: %{y}')
fig.show()
push_viz2(fig,'accidentsbyweekday') #pushing the plot to plotly servers
Here we clearly see that accidents happen less on the weekends. The accidents number are pretty consistent from Monday to Thursday but peak in the week on Friday. Our theory is that this is probably because people are most rushed on this day to get to home from work. If this is true, then we should see more accidents happening on Fridays during 15-17 in the week.
In order to test our theory and further find some more weekly patterns. We split the week into 168 hours, starting from 00:00 on Monday as hour 0 and ending with midnight on Sunday.
Lets loook at the accident count by hour of the week:
fig = px.bar(df_hour_of_week, x='hourofweek', y="Count", title='Number of accidents by hour of week', color="hourofweek",text="Count")
fig.update_layout(showlegend=False)
fig.update_xaxes(title='Hour of week')
fig.update_yaxes(title='Number of accidents')
fig.update_traces(hovertemplate='Hour of week: %{x} <br>Number of accidents: %{y}')
fig.show()
push_viz2(fig,'accidentsbyhourofweek') #pushing the plot to plotly servers
Looking at the visualization above, we do see a peak of 5966 accidents in the 112 hour of the week which comes to be 16h on Friday. The second maximum is on Friday to at 17h with 5924 accidents. The third maximum is on Thurday at 17 h with 5812 accidents followed by Thursday at 16h aith 5733 accidents.
These results indicate that Friday evenings are most dangerous times to drive between 16-18 followed by any working day betweeen the same hours. This indicates that people should be especially careful when on the road and show more patience especially on Fridays. The lack of patience and accident number is due to high traffic volumes but this reiterates the greater need to especially be calm when under stress.
Now, we enlarge the time-perspective a bit and look at the monthly and seasonal patterns. We want to address the question of whether the month/season has an impact on how many accidents are there chooses to bicycle. To analyze this, lets first look at the number of accidents over the month of the year.
fig = px.bar(df_month, x='month', y="Count", title='Number of accidents by month', color="month", text='Count')
fig.update_layout(showlegend=False)
fig.update_xaxes(title='months')
fig.update_yaxes(title='Number of accidents')
fig.update_traces(hovertemplate='Hour: %{x} <br>Number of accidents: %{y}')
fig.show()
push_viz2(fig,'accidentsbymonth') #pushing the plot to plotly servers
This visualization shows differences between accidents based on month. Most accidents happen in January. There is a slight dip in Februrary but that could also be because the number of days are less in Februrary. There is a significant dip in April which is suprising. Also the end of the year November and December has less accidents. This could be because of less people driving in the snow in the winters.
Lets look at the accident count by week of year to look deeper into the monthly patterns:
fig = px.bar(df_weekofyear, x='weekofyear', y="Count", title='Number of accidents by week of year', color="weekofyear",text="Count")
fig.update_layout(showlegend=False)
fig.update_xaxes(title='Week of year')
fig.update_yaxes(title='Number of accidents')
fig.update_traces(hovertemplate='Week of year: %{x} <br>Number of accidents: %{y}')
fig.show()
push_viz2(fig,'accidentsbyweekofyear') #pushing the plot to plotly servers
From the visualization, it appears that the last two weeks in December are verry low for accidents. This could be because of the holiday season and people being happy and relaxed and more considerate of other drivers. The last week of the year has the lowest accidents by far which could additionally because of people being at home with their families and less traffic on roads.
We also observe peaks starting the 1st week of January and peaking around mid january. This could be a result of people being more ambitious towards their new year goals and more rushed apart from other reasons.
The most suprising is the dip around week 13-15. This could be attributed to spring and Easter holidays. Another thing that might be skewing numbers for April is the strict lockdown announced in April 2020 due to the COVID-19 virus. To investigate this further we will look at yearly patterns in the next section.
Finally we want to investigate whether there seem to be any patterns in how many accidents are there over the years. Here we are keen to look at the effect of COVID-19 and the lockdown on the number of accidents that happened and the contrast with non-lockdown years. For this we first visualize the number of accidents over the three years to see if can see any pattern.
fig = px.bar(df_year, x='Year', y="Count", title='Number of accidents by year', color="Year", text='Count')
fig.update_layout(showlegend=False)
fig.update_xaxes(title='Year')
fig.update_yaxes(title='Number of accidents')
fig.update_traces(hovertemplate='Year: %{x} <br>Number of accidents: %{y}')
fig.show()
push_viz2(fig,'accidentsbyyear') #pushing the plot to plotly servers
From the plot we observe that there was a decrerase in the number of accidents going from 2018 to 2019 which is due to a general increase in road safety over years. Unsuprisingly 2020, saw a huge drop with the number of accidents almost falling to half. This was probably due to the COVID-19 pandemic and people staying and working from home and the lockdowns.
Lets investigate further by looking at monthly patterns over years:
fig = px.bar(df_month_year, x='month', y="Count", title='Number of accidents by month by year', facet_col="Year",color="month", text='Count', barmode="group")#,category_orders={"Year": [2018, 2019,2020]})
fig.update_layout(showlegend=False,barmode='group')
fig.update_xaxes(title='months')
fig.update_traces(hovertemplate='month: %{x} <br>Number of accidents: %{y}')
fig.show()
push_viz2(fig,'accidentsbymonthbyyear') #pushing the plot to plotly servers
There is a lot to unpack in the plot above. Firstly we can see that 2020-the COVID year has relatively low activity over the entire year in comparison to the other two years. Secondly its importnat to note how there is always a dip in the number of accidents for April which is suprising. This could be attributed to Easter Holidays but we are not sure. Thirdly, Februrary sees a dip probably due to lower number of days except in 2020 when the effect of lockdown in March has reduced the accident numbers. Lastly, and most importantly, there is a trend of general decrease in the number of accidents month over month which is a good sign meaning people are getting more aware of road safety.
Finally, lets look at the number of accidents over week over years to better see weekly deviations in accident patterns over these three years in New York:
fig = px.bar(df_week_year, x='weekofyear', y="Count", title='Number of accidents by week of year by year', color="weekofyear",facet_col="Year",text="Count")
fig.update_layout(showlegend=False,barmode='group')
fig.update_xaxes(title='Week of year')
fig.update_traces(hovertemplate='Week of year: %{x} <br>Number of accidents: %{y}')
fig.show()
push_viz2(fig,'accidentsbyweekofyearbyyear') #pushing the plot to plotly servers
From the above plot, firstly we would like to point out the dip starting around week 12 and dropping completely in in week 15 in 2020. Thats around end of March and April. This was the time the details about the new virus were coming out and lockdowns were being imposed so this is good the effect is seen in the data as well.
Secondly,we see a peak around week 25-26 which is about in June when the weather is good and more people are out on the roads. Even during the pandemic year higher levels of activity can be seen in the summer months.
Lastly we observe a dip in December which is probably due to the holiday season. From these visualization, it seems like holidays is a good time for road safety. This is probably due to more relaxed state of minds and a higher than usual desire to get to your loved ones safely.
del(df_month)
del(df_weekday)
del(df_hour_of_week)
del(df_month_year)
del(df_week_year)
del(df_year)
del(df_weekofyear)
In this part of the data analysis, we'll look into the geographical patterns of the data in New York to investigate questions like:
For the purpose of answering these questions, we will use a heat map to look at a distribution of the accidents across different regions in the city . After that we will look at the accidents across districts in terms of how many people died, people injured, motorist died, motorist injured, pedestrian died, pedestrian injured, cyclist died and cyclist injured. We further provide a interactable visualization to explore the districts on their death and injury tolls.
We believe that the visualizations would make it easier to compare districts by the user depending on the district they live, work or commute through and the mode of transport used.
So the first pattern we want to investigate here is the distribution of accidents across the states. To examine this, we visualize here a heat map based on the latitude and longitude of the accidents aggregated over the entire time period. Here we just want to see where do accidents happen if we forget the dimension of time for a bit. Lets look at the heat map to see the distribution in the cell below:
map_hooray = folium.Map(location=[40.7128, -74.0060],tiles="Stamen Toner",zoom_start = 10.5)
# Ensure you're handing it floats
df['LATITUDE'] = df['LATITUDE'].astype(float)
df['LONGITUDE'] = df['LONGITUDE'].astype(float)
# List comprehension to make out list of lists
heat_data = [[row['LATITUDE'],row['LONGITUDE']] for index, row in df.iterrows()]
# Plot it on the map
HeatMap(heat_data, min_opacity=0.20, blur=12, radius=15, max_zoom=1,
).add_to(map_hooray)
#folium.Marker([40.77542, -64.4034], popup=" Hall of Justice", icon=folium.Icon(color="red", icon="info-sign")).add_to(map_hooray)
map_hooray.save("heatmap.html")
# Display the map
map_hooray